PREDICT CUSTOMER RESPONSE TO A PROMOTIONAL OFFER¶

Project Goals¶

  • What is the average age of the Starbucks' customer?
  • Which are the most successful offers?
  • Determine the success of each offer by gender and age groups.
  • Create a ML model to predict who is likely to complete the offer once he/she views the offer given the demographics of the user and offer characteristics.

Datasets¶

The data is contained in three files:

  • portfolio.json - containing offer ids and other attributes about each offer (duration, type, etc.)
  • profile.json - demographic data for each customer
  • transcript.json - records for transactions, offers received, offers viewed, and offers completed

Table of Contents¶

  • 1. Loading Data
  • 2. Data Exploring
  • 3. Data Cleaning
  • 4. Explanatory Data Analysis
  • 5. Feature Engineering

Import all necessary libraries

In [1]:
import pandas as pd
import json
import numpy as np
import datetime
from tqdm import tqdm_notebook as tqdm
import pickle
import time
In [2]:
# importing sys
import sys
 
# adding scripts to the system path
sys.path.insert(0, '/Users/fidahussainrao/Downloads/Predict customer response/scripts')
In [3]:
from eda import *

import plotly.io as pio
pio.renderers.default='notebook'

import warnings
warnings.filterwarnings('ignore')

1. Loading Data ¶

In [4]:
def load_data(file):
    return pd.read_json(f'../data/{file}.json', orient='record', lines=True)

portfolio = load_data('portfolio')
profile = load_data('profile')
transcript = load_data('transcript')

2. Data Exploring¶

a. Offer Characteristics¶

In [5]:
explore_df(portfolio)
Shape of the dataframe is: (10, 6)

There are no missing values in this dataframe.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   reward      10 non-null     int64 
 1   channels    10 non-null     object
 2   difficulty  10 non-null     int64 
 3   duration    10 non-null     int64 
 4   offer_type  10 non-null     object
 5   id          10 non-null     object
dtypes: int64(3), object(3)
memory usage: 608.0+ bytes
None

          reward  difficulty   duration
count  10.000000   10.000000  10.000000
mean    4.200000    7.700000   6.500000
std     3.583915    5.831905   2.321398
min     0.000000    0.000000   3.000000
25%     2.000000    5.000000   5.000000
50%     4.000000    8.500000   7.000000
75%     5.000000   10.000000   7.000000
max    10.000000   20.000000  10.000000

Out[5]:
reward channels difficulty duration offer_type id
0 10 [email, mobile, social] 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd
1 10 [web, email, mobile, social] 10 5 bogo 4d5c57ea9a6940dd891ad53e9dbe8da0
2 0 [web, email, mobile] 0 4 informational 3f207df678b143eea3cee63160fa8bed
3 5 [web, email, mobile] 5 7 bogo 9b98b8c7a33c4b65b9aebfe6a799e6d9
4 5 [web, email] 20 10 discount 0b1e1539f2cc45b7b9fa7c272da2e1d7
In [6]:
# unique offer types
portfolio['offer_type'].unique()
Out[6]:
array(['bogo', 'informational', 'discount'], dtype=object)
In [7]:
cols = ['reward', 'difficulty', 'duration']
plot_hist(portfolio, cols)

The above exploration shows that:

  1. The channels columns consists of a nested list, hence I will have to expand the column later during the data cleaning stage.
  2. No missing values in this dataset.
  3. There are three types of offers, those are:
    • BOGO - Buy One Get One
    • Informational - provides information about the products
    • Discount - discount on purchases
  4. The most common difficulty (the minimum required to  avail of the offer) is 10, and the most common duration (the time period for which the offer is available) is 7.

b. User Demographics¶

In [8]:
explore_df(profile)
Shape of the dataframe is: (17000, 5)

Missing values
gender              2175
age                    0
id                     0
became_member_on       0
income              2175
dtype: int64

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            14825 non-null  object 
 1   age               17000 non-null  int64  
 2   id                17000 non-null  object 
 3   became_member_on  17000 non-null  int64  
 4   income            14825 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.2+ KB
None

                age  became_member_on         income
count  17000.000000      1.700000e+04   14825.000000
mean      62.531412      2.016703e+07   65404.991568
std       26.738580      1.167750e+04   21598.299410
min       18.000000      2.013073e+07   30000.000000
25%       45.000000      2.016053e+07   49000.000000
50%       58.000000      2.017080e+07   64000.000000
75%       73.000000      2.017123e+07   80000.000000
max      118.000000      2.018073e+07  120000.000000

Out[8]:
gender age id became_member_on income
0 None 118 68be06ca386d4c31939f3a4f0e3dd783 20170212 NaN
1 F 55 0610b486422d4921ae7d2bf64640c50b 20170715 112000.0
2 None 118 38fe809add3b4fcf9315a9694bb96ff5 20180712 NaN
3 F 75 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000.0
4 None 118 a03223e636434f42ac4c3df47e8bac43 20170804 NaN
In [9]:
profile['gender'].unique()
Out[9]:
array([None, 'F', 'M', 'O'], dtype=object)
In [10]:
cols = ['age', 'income']
plot_hist(profile, cols, bins=50)

The exploration above shows that:

  1. Max age is 118, which is probably the outlier.
  2. There are missing values in the gender and income columns
  3. There are three types of gender, those are Male, Female and Others.
  4. 75% of the users are under 73 years and earn up to $80000.
  5. The became_member_on column will be feature engineered to get the tenure of membership in days.

c. Transactional Records¶

In [11]:
explore_df(transcript)
Shape of the dataframe is: (306534, 4)

There are no missing values in this dataframe.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   person  306534 non-null  object
 1   event   306534 non-null  object
 2   value   306534 non-null  object
 3   time    306534 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 9.4+ MB
None

                time
count  306534.000000
mean      366.382940
std       200.326314
min         0.000000
25%       186.000000
50%       408.000000
75%       528.000000
max       714.000000

Out[11]:
person event value time
0 78afa995795e4d85b5d9ceeca43f5fef offer received {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 0
1 a03223e636434f42ac4c3df47e8bac43 offer received {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} 0
2 e2127556f4f64592b11af22de27a7932 offer received {'offer id': '2906b810c7d4411798c6938adc9daaa5'} 0
3 8ec6ce2a7e7949b1bf142def7d0e0586 offer received {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} 0
4 68617ca6246f4fbc85e91a2a49552598 offer received {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} 0
In [12]:
## there is a common key in the value column for every event.

for event in transcript['event'].unique():
    key_ = []
    for val in transcript[transcript['event'] == event]['value']:
        key_.append(val.keys())
    
    print(f'The common key(s) in the values column that corresponds to {event.capitalize()} in the event column is/are:')
    j = 0
    for k in list(np.unique(key_)[0]):
        print(f'{j+1} - {k.capitalize()}')
        j+=1
    print()
The common key(s) in the values column that corresponds to Offer received in the event column is/are:
1 - Offer id

The common key(s) in the values column that corresponds to Offer viewed in the event column is/are:
1 - Offer id

The common key(s) in the values column that corresponds to Transaction in the event column is/are:
1 - Amount

The common key(s) in the values column that corresponds to Offer completed in the event column is/are:
1 - Offer_id
2 - Reward

In [13]:
list(np.unique(key_)[0])
Out[13]:
['offer_id', 'reward']
In [14]:
plt.figure(figsize=(12,5))
plt.hist(transcript['time'], bins=50, color='lightblue')
plt.grid(axis='y', alpha=0.75)
plt.title('TIME');

The exploration above shows that:

  1. There are 4 different events.
  2. A total of 306534 transactional records have been given.
  3. Value column contains different attributes for different events. It will be processed according to the event in the later stages.
  4. This data is sorted by time column.

3. Data Cleaning¶

a. Cleaning the portfolio dataframe¶

  • One hot encode the 'channels' columns
  • Add offer number to every offer.
  • Rename the 'id' column to 'offer_id'
In [15]:
## cleaning the channels column, creating dummy variables.
def one_hot_encode(item_lists):
    
    # get all the unique channels
    unique_items = []
    for item_list in item_lists:
        for item in item_list:
            unique_items.append(item)
    unique_items = list(set(unique_items))
    
    # Create empty dict
    item_dict = {}
        
    # Loop through all the channels
    for item in unique_items:
        
        # Apply boolean mask
        item_dict[item] = item_lists.apply(lambda x: item in x)
            
    # Return the results as a dataframe
    return pd.DataFrame(item_dict)
In [16]:
portfolio = portfolio[['id', 'offer_type', 'duration', 'difficulty', 'reward', 'channels']]
# one hot encode the channels column, and concatenate it with the remaining dataframe.
portfolio = pd.concat([portfolio, one_hot_encode(portfolio['channels'])],axis=1)
portfolio.iloc[:,-4:] = portfolio.iloc[:,-4:].astype('int')
portfolio = portfolio.drop('channels',axis=1)
In [17]:
portfolio.head()
Out[17]:
id offer_type duration difficulty reward mobile email web social
0 ae264e3637204a6fb9bb56bc8210ddfd bogo 7 10 10 1 1 0 1
1 4d5c57ea9a6940dd891ad53e9dbe8da0 bogo 5 10 10 1 1 1 1
2 3f207df678b143eea3cee63160fa8bed informational 4 0 0 1 1 1 0
3 9b98b8c7a33c4b65b9aebfe6a799e6d9 bogo 7 5 5 1 1 1 0
4 0b1e1539f2cc45b7b9fa7c272da2e1d7 discount 10 20 5 0 1 1 0
In [18]:
portfolio = portfolio.reset_index()
portfolio = portfolio.rename(columns={'index': 'offer_num'})
portfolio = portfolio.rename(columns={'id': 'offer_id'})

b. Cleaning the profile dataframe¶

  • Apparently, the gender and income values are missing for the rows which have age = 118, which means it's probably fine to just drop those rows because it appears that these data points aren't delivering meaningful information to support the model's implementation. Plus, being 118 years old is really unusual.
  • Transform the 'became_member_on' column to a datetime object.
  • Add membership start year and start month columns. (for further analysis)
  • Calculate the number of days since the membership started.
  • Change the name of the 'id' column to 'person_id'.
In [19]:
profile[profile['age'] == profile['age'].max()].isna().sum()
Out[19]:
gender              2175
age                    0
id                     0
became_member_on       0
income              2175
dtype: int64
In [20]:
profile = profile[profile['age'] != profile['age'].max()]
In [21]:
profile.isna().sum()
Out[21]:
gender              0
age                 0
id                  0
became_member_on    0
income              0
dtype: int64
In [22]:
## cleaning the date column
profile['became_member_on'] = profile['became_member_on'].apply(lambda x:datetime.datetime.strptime(str(x), '%Y%m%d'))
In [23]:
profile['membership_month'] = profile['became_member_on'].dt.month
profile['membership_year'] = profile['became_member_on'].dt.year
profile['membership_day'] = profile['became_member_on'].dt.day

profile['member_since'] = (datetime.datetime.now() - profile['became_member_on']).dt.days
In [24]:
profile = profile.drop('became_member_on', axis=1)
In [25]:
profile = profile.rename(columns={'id': 'person_id'})

c. Cleaning the transcript dataframe¶

  • Change the name of the 'person' column to 'person_id'
  • Remove person id's that are not in the customer profile DataFrame
  • Convert the unit of time from hours to days.
  • To gain insights from the value column, I will need to split the values into distinct columns based on the event.
    1. Extract the values of offer_id, ammount and reward from value column.
    2. Clean up the offer id and offer_id columns and merge them into one column.
    3. Finally concatenate these values with the transcript dataframe.
In [26]:
transcript = transcript.rename(columns={'person': 'person_id'})
In [27]:
transcript = pd.merge(transcript, profile['person_id'], on='person_id', how='inner')
In [28]:
# convert time from hours to days 
transcript['time'] = transcript['time'] / 24
In [29]:
transcript['event'].unique()
Out[29]:
array(['offer received', 'offer viewed', 'transaction', 'offer completed'],
      dtype=object)
In [30]:
offers = pd.DataFrame.from_records(transcript['value'])
offers.head()
Out[30]:
offer id amount offer_id reward
0 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN NaN NaN
1 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN NaN NaN
2 NaN 19.89 NaN NaN
3 NaN NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 5.0
4 NaN 17.78 NaN NaN
In [31]:
offers['offer_id'] = np.where(offers['offer id'].isnull() & offers['offer_id'].notnull(), 
                              offers['offer_id'], offers['offer id'])
offers = offers.drop('offer id', axis=1)
In [32]:
offers.head(3)
Out[32]:
amount offer_id reward
0 NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN
1 NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN
2 19.89 NaN NaN
In [33]:
transcript = pd.concat([transcript.drop('value',axis=1),offers],axis=1)
transcript.head()
Out[33]:
person_id event time amount offer_id reward
0 78afa995795e4d85b5d9ceeca43f5fef offer received 0.00 NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN
1 78afa995795e4d85b5d9ceeca43f5fef offer viewed 0.25 NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN
2 78afa995795e4d85b5d9ceeca43f5fef transaction 5.50 19.89 NaN NaN
3 78afa995795e4d85b5d9ceeca43f5fef offer completed 5.50 NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 5.0
4 78afa995795e4d85b5d9ceeca43f5fef transaction 6.00 17.78 NaN NaN
In [ ]:
 

4. EDA¶

1. What is the average age of the Starbucks' customer?¶

In [34]:
profile['age'].describe()
Out[34]:
count    14825.000000
mean        54.393524
std         17.383705
min         18.000000
25%         42.000000
50%         55.000000
75%         66.000000
max        101.000000
Name: age, dtype: float64

An average Starbucks customer is 54.4 years old.

In [35]:
plt.figure(figsize=(15,6))
profile['age'].plot.hist(grid=True, bins=profile['age'].nunique(), rwidth=0.9,
                   color='lightslategray')

plt.title('What is the average age of the customer?')
plt.xlabel('Age')
plt.ylabel('Count')
plt.vlines(x = profile['age'].mean(), ymin=0, ymax=450, linestyles='--', linewidth=3, color='k')
plt.text(profile['age'].mean()+.5, 440, 'Mean', ha ='left', va ='center')


plt.grid(axis='y', alpha=0.75)

# gaussian shape centered on the mean value
In [ ]:
 

2. What is the gender distribution of Starbucks customers?¶

In [36]:
# getting the normalized gender distribution
gender_dist = 100 * profile['gender'].value_counts(normalize=True).sort_values()
gender_dist
Out[36]:
O     1.430017
F    41.342327
M    57.227656
Name: gender, dtype: float64
In [37]:
x = gender_dist.index
y = gender_dist
plot_bar_chart(x, y, title='Normalized gender distribution of starbucks customers (%).')

3. Checking the event distribution¶

Analyzing the total instances of the offer completion, offer viewed, offer received and transaction.

In [38]:
event_dist = transcript['event'].value_counts().sort_values()
x = event_dist.index
y = event_dist
plot_bar_chart(x, y, title='Event Distribution')

4. Which are the most successful offers?¶

The offer's success can be determined by dividing the total instances of completed offer by the received offer. Plot the results on a bar chart.

In [39]:
# getting the total instances of each event.
# To get the total number of events for each offer id, we will group by both event and offer id.
offersbyEvent = transcript.groupby(['event', 'offer_id']).count().iloc[:,0].reset_index()
In [40]:
# use pivot function to obtain the table in the required form. 
# drop nan values if any.
offersbyEvent = pd.pivot(data=offersbyEvent, index='offer_id', 
                         columns='event', values='person_id' )[['offer completed', 'offer received']].dropna()
In [41]:
# divinding offer completed by offer received for each offer.
offersbyEvent = offersbyEvent['offer completed'] / offersbyEvent['offer received']
In [42]:
# create a dictionary to map the offer id to their respective offer number. 
id_num = dict(zip(portfolio.offer_id, portfolio.offer_num))
In [43]:
offersbyEvent.index = offersbyEvent.index.map(id_num)
In [44]:
offersbyEvent = offersbyEvent.sort_values(ascending=False)
In [45]:
x = offersbyEvent.index.astype(str)
y = offersbyEvent * 100
title='Which are the most Successful offers?'
plot_bar_chart(x, y, title)

The bar chart above shows that the best offer in the portfolio, in terms of successfullness, is the offer number 6. After that, the other successful offers are the offer number 5, 8, and 3.

5. Most successfull offers for each age group¶

In this section, I will calculate the success rate of each offer by age group. The same method as in the previous section can be used to calculate the success rate. Finally, use a grouped bar chart to visualise the results. I'll create a helper function to create a grouped bar chart in case it's required later.

In [46]:
profile['age'].describe()
Out[46]:
count    14825.000000
mean        54.393524
std         17.383705
min         18.000000
25%         42.000000
50%         55.000000
75%         66.000000
max        101.000000
Name: age, dtype: float64

Obtaining successful offers using all possible ages is computationally expensive and ineffective. As a result, we will categorise customers' ages by decades.

In [47]:
profile['age_group'] = pd.cut(profile['age'], bins=range(10,111,10) )
profile['age_group'].unique()
Out[47]:
[(50, 60], (70, 80], (60, 70], (20, 30], (40, 50], (30, 40], (90, 100], (10, 20], (80, 90], (100, 110]]
Categories (10, interval[int64, right]): [(10, 20] < (20, 30] < (30, 40] < (40, 50] ... (70, 80] < (80, 90] < (90, 100] < (100, 110]]
In [48]:
# create a dictionary to map person_id to their respective age groups
id_agegroup = dict(zip(profile.person_id, profile.age_group))
transcript['age'] = transcript['person_id'].map(id_agegroup)
In [49]:
print(transcript['age'].value_counts())
transcript.head()
(50, 60]      64707
(60, 70]      52308
(40, 50]      43611
(30, 40]      31532
(70, 80]      29216
(20, 30]      27037
(80, 90]      13855
(10, 20]       6544
(90, 100]      3854
(100, 110]       98
Name: age, dtype: int64
Out[49]:
person_id event time amount offer_id reward age
0 78afa995795e4d85b5d9ceeca43f5fef offer received 0.00 NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN (70, 80]
1 78afa995795e4d85b5d9ceeca43f5fef offer viewed 0.25 NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN (70, 80]
2 78afa995795e4d85b5d9ceeca43f5fef transaction 5.50 19.89 NaN NaN (70, 80]
3 78afa995795e4d85b5d9ceeca43f5fef offer completed 5.50 NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 5.0 (70, 80]
4 78afa995795e4d85b5d9ceeca43f5fef transaction 6.00 17.78 NaN NaN (70, 80]
In [50]:
# getting the total count of each event grouped by the offer id and the age groups.
offersbyAge = transcript.groupby(['age', 'event', 'offer_id']).count().iloc[:,0].reset_index()
In [51]:
# getting the table in the required format.
offersbyAge = pd.pivot(data=offersbyAge, index=['age','offer_id'], columns='event', values='person_id' )
In [52]:
# divide the offer completed by the offer received to obtain the success rate of each offer.
offersbyAge = (offersbyAge['offer completed'] / offersbyAge['offer received']).dropna().reset_index()
offersbyAge = offersbyAge.rename(columns={0:'success_rate'})
In [53]:
offersbyAge['offer_num'] = offersbyAge['offer_id'].map(id_num)
In [54]:
offersbyAge = offersbyAge.sort_values(by=['age', 'offer_num'])
In [55]:
title = 'Which are the most successful offer for each age group?'
plot_groupedBarChart(offersbyAge, 'age', title)

The graph above shows the most successful offers for each age group, that are:

  • 10-20: Offer 5
  • 20-30: Offer 5
  • 30-40: Offer 6
  • 40-50: Offer 5
  • 50-60: Offer 6
  • 60-70: Offer 6
  • 70-80: Offer 5
  • 80-90: Offer 6
  • 90-100: Offer 6
  • 100-110: Offer 1,5,8 & 9 (draw)

6. What are the most successful offer for each gender¶

Now, I will calculate the success rate of each offer with respect to the genders. We have three genders in the profile dataset, they are Male, Female, and others. Lastly, visualize the results using the helper function created in the previous section.

In [56]:
# create a dictionary to map person_id to their respective genders.
id_gender = dict(zip(profile.person_id, profile.gender))
transcript['gender'] = transcript['person_id'].map(id_gender)
transcript['gender'].value_counts()
Out[56]:
M    155690
F    113101
O      3971
Name: gender, dtype: int64

We can easily calculate the success rate for each gender using the same method we used to calculate it for different age groups.

In [57]:
offersbyGender = transcript.groupby(['gender', 'event', 'offer_id']).count().iloc[:,0].reset_index()

offersbyGender = pd.pivot(data=offersbyGender, index=['gender','offer_id'], columns='event', values='person_id' )

offersbyGender = (offersbyGender['offer completed'] / offersbyGender['offer received']).dropna().reset_index()
offersbyGender = offersbyGender.rename(columns={0:'success_rate'})

offersbyGender['offer_num'] = offersbyGender['offer_id'].map(id_num)

offersbyGender = offersbyGender.sort_values(by=['gender', 'offer_num'])
In [58]:
title = 'Which are the most successful offer for each gender?'
plot_groupedBarChart(offersbyGender, 'gender', title)

The graph above shows the most successful offers for each gender, that are:

  • Male: Offer 6
  • Female: Offer 6
  • Other: Offer 8

7. Analyze the distributions of user joining years and months.¶

In [59]:
plot_hist(profile, ['member_since', 'membership_year', 'membership_month'])
  • Majority of the customers have been member since around 1500-2000 days.
  • 2017 has the highest new registered customers
  • Increasing trend of new registrations starting from 2013.
  • New memberships are pretty much constant for every month, with the exception of January and December, when we experience a spike in new memberships.

5. Feature Engineering¶

a. Defining the target variable¶

In order to determine what an successful offer is, I will further explore all the three datasets and see how they interact with each other. Firstly, exploring what kind of event are there for each offer type.

In [60]:
a = pd.merge(transcript[['offer_id','event','time']], 
             portfolio[['offer_id', 'offer_type']],
             on='offer_id', how='left')

a.groupby(['offer_type', 'event']).count().iloc[:,1]
Out[60]:
offer_type     event          
bogo           offer completed    15258
               offer received     26537
               offer viewed       22039
discount       offer completed    17186
               offer received     26664
               offer viewed       18461
informational  offer received     13300
               offer viewed        9360
Name: time, dtype: int64

There are four types of events in our data: offer received, offer viewed, transaction, and offer completed. However, as indicated above, the transaction is not related with any offer type. The transcript dataframe contains null values for Offer ID and Offer type for the transaction. As a result, we'll need to figure out how to populate the Offer IDs for transactions.

Sorting the transcript dataset by person and time to ensure that each event for each person occurs in sequence.

In [61]:
transcript_processed = transcript.sort_values(by=['person_id', 'time'])
transcript_processed = transcript_processed.reset_index(drop=True)
In [62]:
def clean_transactions(data):
    
    """ 
    In the transcript table, offer IDs are null when the corressponding event
    is transaction. This function will fill in the missing offer ids.
    INPUT:
        data: dataframe containing the transactional records and offer profile
              for a customer.
        
    OUTPUT:
        data: Cleaned dataframe, with no null offer IDs.
    """
        
    data['prev_person_id'] = data['person_id'].shift() 

    data['offer_id'] = np.where((data['offer_id'].isnull()) 
                                      & (data['prev_person_id'] == data['person_id']),
                                      data['offer_id'].shift(),
                                      data['offer_id'])

    # if a customer has made transaction without even receiving the offer, the offer_id will still remain NaN. 
    # We will drop these records, since there is no way of filling these offer IDS. Plus, these are unsuccessful 
    # offers since the customer has made the transaction w/o any knowledge of the offer.
    data = data.dropna(subset='offer_id')
    data = data.drop('prev_person_id', axis=1)
    return data

transcript_processed = clean_transactions(transcript_processed)
transcript_processed.isna().sum()
Out[62]:
person_id         0
event             0
time              0
amount       148805
offer_id          0
reward       188174
age               0
gender            0
dtype: int64

Successful Offers¶

Moreover, it can also be noted that the BOGO and discount have the offer completed event once the offer is completed. However, the informational event does not have this event associated with it. Now we can define the successful offer as follows:

Group 1

  • For BOGO and discount offer type the event has to follow the given sequence for it to be considered a successful offer:

offer received -> offer viewed -> transaction -> offer completed

  • Whereas, for the information offer type to be successful, following sequence has to be followed:

offer received -> offer viewed -> transaction (Informational offers must be completed within the given time duration)

Ineffective Offers¶

In following cases the offer will be considered ineffective:

Group2

Offer was received and viewed but no response (unsuccessful).

  • offer received -> offer viewed

Group 3

Customers who purchased/completed the offer without being aware of it. If an offer is completed or a transaction occurs prior to an offer being viewed, or if an offer is completed after the offer is viewed, but a transaction occurred prior to the offer being viewed. The offer may have been completed in these cases, but it is not a successful conversion.

  • a. For BOGO and Discount offer type

    • offer received -> transaction -> offer completed-> offer viewed
  • b. For Informational offer type

    • offer received -> transaction -> offer viewed

Group 4

Offer received but not viewed, hence no further action.

  • offer received

ai. Getting successful offers¶

In [63]:
def valid_offer(data):
    
    """ 
    Determine whether the offer was completed within the given time duration.
    INPUT:
        data: dataframe containing the transactional records and offer profile
              of an offer for a customer.
        
    OUTPUT:
        offer_df: Additional column called valid_offer appended to the given
                  dataframe.
    """
    
    offer_df = data.copy()
    valid = []
    limit = offer_df['duration'].unique()
    for idx in range(len(offer_df)):
        if offer_df['event'].iloc[idx] == 'offer received':
            start = offer_df.iloc[idx]['time']
        end = start + limit
        valid.append((offer_df.iloc[idx]['time'] < end).astype('int')[0])
    offer_df['valid_offer'] = valid

    return offer_df



def define_offer_success(data, offer_type):
   
    """ 
    Determine whether the offer was successfully completed.
    INPUT:
        data: dataframe containing the transactional records and offer profile
              of an offer for a customer.
        offer_type: type of the given offer from informational, BOGO, discount.
    OUTPUT:
        offer_df: Additional column called offer_success appended to the given
                  dataframe.
    """


    offer_df = data.copy()

    if offer_type == 'informational':
        offer_df['offer_success'] = np.where((offer_df['event'] == 'transaction') & \
                                                 (offer_df['prev_event_1'] == 'offer viewed') & \
                                                 (offer_df['prev_event_2'] == 'offer received') & \
                                                 (offer_df['valid_offer'] == True), 1,0)


    elif offer_type in ['discount', 'bogo']:
        offer_df['offer_success'] = np.where((offer_df['event'] == 'offer completed') & \
                                             (offer_df['prev_event_1'] == 'transaction') & \
                                             (offer_df['prev_event_2'] == 'offer viewed') & \
                                             (offer_df['prev_event_3'] == 'offer received') & \
                                             (offer_df['valid_offer'] == True),1,0)

    return offer_df

It has been discovered that some customers have received the same offer many times. However, these clients did not always complete the offer. If I add every instance of completed and uncompleted offers in the final dataframe for the customer, offer combination, our model will become confused. As a result, I've chosen to return only one record for each customer, offer pair, and if the customer finished the offer even once, the offer will be successful; otherwise, it will fail. In addition, I will include a total offers received column to assess how many offers were required for that consumer to finish the offer.

NOTE: By including the total offers received, I completely rule out the possibility that a consumer completed the same offer more than once. I considered including a column for total offers completed; however, this could result in data leakage.

In [65]:
def offer_success():
    
    """ 
    Creates a dictionaries that describes the effectiveness of each
    offers to a specific customer. An offer will be effective if the
    customer has completed it once.
    OUTPUT:
        d: A dictionary containing the offer_id, person_id, and offer_success.
    """


    person_ids = transcript_processed['person_id'].unique()
    d = dict()
    d['offer_success'], d['offer_id'], d['person_id'] = [], [], []    

    for person_id in tqdm(person_ids):
        person_df = transcript_processed[transcript_processed['person_id'] == person_id]
        
        # merge with portfolio column
        person_df = pd.merge(person_df, portfolio[['offer_id', 'offer_type', 'duration']], on='offer_id', how='left')

        offer_ids = person_df['offer_id'].unique()

        for i in range(len(offer_ids)):

            offer_df = person_df[person_df['offer_id'] == offer_ids[i]]

            offer_df['prev_event_1'], offer_df['prev_event_2'], offer_df['prev_event_3'] = offer_df['event'].shift(), \
            offer_df['event'].shift(2),\
            offer_df['event'].shift(3)
            
            # check whether the offer was completed within the given time.
            offer_df = valid_offer(offer_df)
            # getting the offer type of the current offer.
            offer_type = offer_df['offer_type'].unique()[0]
            # checking whether the current offer was successful or not
            offer_df = define_offer_success(offer_df, offer_type)
            
            # add the offer_success, offer_id, and person_id to the dictionary
            d['offer_success'].append(offer_df['offer_success'].max())
            d['offer_id'].append(offer_df['offer_id'].unique()[0])
            d['person_id'].append(offer_df['person_id'].unique()[0])

    return d


# d = offer_success()
  
  0%|          | 0/14820 [00:00<?, ?it/s]

Save the dictionary obtained after running the following function, since it is a time comsuming and inconvenient to run it every time the notebook is refreshed. We can simply load the dictionary from the storage if the notebook is restarted.

In [66]:
# with open('../data/cleaned/offer_success.pkl', 'wb') as f:
#     pickle.dump(d, f)
        
with open('../data/cleaned/offer_success.pkl', 'rb') as f:
    loaded_dict = pickle.load(f)
In [67]:
df = pd.DataFrame.from_dict(loaded_dict)
df.head()
Out[67]:
offer_success offer_id person_id
0 1 5a8bc65990b245e5a138643cd4eb9837 0009655768c64bdeb2e877511632db8f
1 0 3f207df678b143eea3cee63160fa8bed 0009655768c64bdeb2e877511632db8f
2 0 f19421c1d4aa40978ebb69ca19b0e20d 0009655768c64bdeb2e877511632db8f
3 0 fafdcd668e3743c1bb461111dcafc2a4 0009655768c64bdeb2e877511632db8f
4 0 2906b810c7d4411798c6938adc9daaa5 0009655768c64bdeb2e877511632db8f
In [68]:
df['offer_success'].value_counts()
Out[68]:
0    32549
1    22673
Name: offer_success, dtype: int64

aii. Getting unsuccessful offers¶

We will be getting unsuccessful offers now that we have gotten successful offers. Unsuccessful offers are ones that were viewed but no action was taken or were completed after the time limit had passed.

First and foremost, I will define a function that will return all offers that were completed after the specified time limit. These offers will be deemed failed. Following that, we will retrieve the remaining customers, offer pairs, and develop a function to retrieve those offers that were viewed but not completed.

In [69]:
ineffective_offers = df[df['offer_success'] == 0]
effective_offers = df[df['offer_success'] == 1]

Approach I'll combine the events of all the customer, offer pairs in a single string to obtain the unsuccessful offers. Following that, I'll look for the following substring in the event column:

  1. offer received -> offer viewed -> offer received
  2. offer received -> offer viewed

Substring 1 returns true if the customer received the same offer more than once, whereas substring 2 returns true if the customer only received the offer once or if it was the last offer received that he couldn't complete.

NOTE: Above steps would only be performed on the valid offers (completed within time limit) which were initially deemed ineffective in section ai.

In [90]:
def invalid_offer():
    
    """ 
    Creates a dictionaries that will return those customers
    who completed their offers after the given time limit.
    OUTPUT:
        d: A dictionary containing the offer_id, person_id, and valid_offer.
    """


    person_ids = ineffective_offers['person_id'].unique()
    d = dict()
    d['offer_invalid'], d['offer_id'], d['person_id'] = [], [], []    

    for person_id in tqdm(person_ids):
        person_df = ineffective_offers[ineffective_offers['person_id'] == person_id]
        
        # merge with portfolio column
        person_df = pd.merge(person_df, portfolio[['offer_id', 'offer_type', 'duration']],
                             on='offer_id', how='left')
        
        person_df = pd.merge(person_df, transcript_processed[['event','time', 'person_id', 'offer_id']], 
                             on=['person_id', 'offer_id'], how='inner')
        
        offer_ids = person_df['offer_id'].unique()

        for i in range(len(offer_ids)):

            offer_df = person_df[person_df['offer_id'] == offer_ids[i]]
            
            # check whether the offer was completed within the given time.
            offer_df = valid_offer(offer_df)
            
            offer_df['offer_invalid'] = np.where(offer_df['valid_offer'] == False, 1,0)
            
            # add the offer_success, offer_id, and person_id to the dictionary
            d['offer_invalid'].append(offer_df['offer_invalid'].max())
            d['offer_id'].append(offer_df['offer_id'].unique()[0])
            d['person_id'].append(offer_df['person_id'].unique()[0])            

    return d


def get_group2_offers(string, informational=False):
    
    """ 
    Get all the offers which were viewed by the customer, but no 
    further action was taken.
    INPUT:
        string: combined events for a customer,offer pair
        informational: True if the offer_type is informational, else
        False.
    OUTPUT:
        boolean variable: returns 1 if customer viewed the offer but did
        not take any action, otherwise returns 0.
    """
    
    substring_1 = 'offer received,offer viewed,offer received'
    substring_2 = 'offer received,offer viewed' 
    
    if substring_1 in string or string.endswith(substring_2):
        return 1
    
    elif informational == False:
        # invalid offer -- not completed within the given time
        if 'transaction' in string and 'offer completed' not in string:
            return 1
        else:
            return 0
    else:
        return 0



def combine_event(val_df):
    
    '''
    Combine all the events of a customer, offer pair in a string.
    INPUT:
        val_df: dataframe containing the offers which were completed in
        due time.
    '''
    
    val_df = val_df[['person_id','offer_id','event']].groupby(['person_id',
                                                               'offer_id'])['event'].apply(','.join).reset_index()

    return val_df

Save the following dictionary in a pickle file so we don't have to run the above function every time we load the notebook.

In [91]:
# d = invalid_offer()

# with open('../data/cleaned/invalid_offer.pkl', 'wb') as f:
#     pickle.dump(d, f)
  0%|          | 0/13291 [00:00<?, ?it/s]
In [92]:
# load the dictionary containing the valid and invalid offers
with open('../data/cleaned/invalid_offer.pkl', 'rb') as f:
    loaded_dict = pickle.load(f)

ineffective_df = pd.DataFrame.from_dict(loaded_dict)
In [98]:
## get the valid offers to check for those customers who didnot complete the offer after viewing it.
valid_offers = ineffective_df[ineffective_df['offer_invalid']==0]

## merge the transactional records for each of the customer
valid_offers = pd.merge(transcript_processed, valid_offers.iloc[:,1:], 
                        on=['offer_id', 'person_id'], how='inner')

## add offer type, given the offer ID
offertype = portfolio[['offer_id', 'offer_type']]
valid_offers = pd.merge(valid_offers, offertype, on='offer_id', how='inner')
In [99]:
## split the data: one containing the informational offer type, other containing the remaining offer types.
# also combine the events into a string.
non_informational = combine_event(valid_offers[valid_offers['offer_type'] != 'informational'])
informational = combine_event(valid_offers[valid_offers['offer_type'] == 'informational'])
In [100]:
# Finally getting the customers who didn't complete their offer.
non_informational['unsuccessful'] = non_informational['event'].apply(lambda x: get_group2_offers(x))

informational['unsuccessful'] = informational['event'].apply(lambda x: \
                                                             get_group2_offers(x, informational=True))
In [101]:
# concatenate the different offer types, and the invalid offers together.
# introduce a variable 'offer success' and give it the value 0, before concatenating this table with the 
# successful offers
invalid_offers = ineffective_df[ineffective_df['offer_invalid']==1]
unsuccessful_df = pd.concat([non_informational[non_informational['unsuccessful']==1][['person_id', 'offer_id']], 
                             informational[informational['unsuccessful']==1][['person_id', 'offer_id']],
                             invalid_offers[['person_id', 'offer_id']]],axis=0).iloc[:,:2].reset_index(drop=True)

unsuccessful_df['offer_success'] = 0
In [102]:
df = pd.concat([effective_offers, unsuccessful_df]).sample(frac=1).reset_index(drop=True)
df.head()
Out[102]:
offer_success offer_id person_id
0 1 f19421c1d4aa40978ebb69ca19b0e20d 18f6d43ce45c41b398606753c3412f60
1 0 3f207df678b143eea3cee63160fa8bed dce9bfaf37e64b75b5c40866cfe71c80
2 0 0b1e1539f2cc45b7b9fa7c272da2e1d7 5113f5c8e57b4a0d9b42d4919562abd0
3 1 f19421c1d4aa40978ebb69ca19b0e20d ac18d9e07ed24f9ebd269082390bedb5
4 1 2906b810c7d4411798c6938adc9daaa5 055eab487b874f38b65f3edf31b5eb41

aiii. Getting all the customers who did not view the offer¶

Now getting all the customers who didnot view the offer. There are two groups of customers in this category:

  • Group3 - Customers who completed the offer without any knowledge of the offer.
  • Group4 - Customers who didnot complete the offer.

These customer-offer pairs will not be utilised to train a machine learning model, but we will use them to analyse the consumer profile and save the costs associated with sending them the offers. It is meaningless to send them the offers since group 3 consumers will complete the offer regardless of whether they receive the offer, and group 4 customers will not make the transaction because they are not viewing the offer. Group 4 customers can be targetted on some other mediums where they are more likely to view the offer.

In the following function, we will identify customers from groups 3 and 4 from the aforementioned dataframe for those who did not fall into group 2. Customers in Group 4 can be easily identified because they will have no further occurrences other than 'offer received.' Group 3 customers, on the other hand, can be searched for the 'offer completed' event. Because we have already extracted the successful offers, the remaining customers who have completed the offer are most likely in group 3. We shall look for a 'transaction' event for informational offers.

In [103]:
def grp3_4(string, informational=False):
    
    """ 
    Get all offers which were not even viewed.
    INPUT:
        string: combined events for a customer,offer pair
        informational: True if the offer_type is informational, else
        False.
    OUTPUT:
        integer: returns 3 if customer completed the offer, 
        otherwise returns 4.
    """

    
    if informational == False:
        substring = 'offer completed'
        # customers who completed the offer w/o viewing the offer.
        if substring in string:
            return 3
        # customer who didn't take any action after receiving the order.
        elif 'offer completed' not in string and 'offer viewed' not in string and 'transaction' not in string :
            return 4
    
    elif informational == True:
        # customers who made the transaction w/o viewing the offer.
        substring = 'transaction'
        if substring in string:
            return 3
        elif 'offer viewed' not in string and 'transaction' not in string:
            return 4
In [104]:
# getting all the customer, offer pairs who didnot view the offer for both Informational and non-Informational
# offer type.
grp3_4_I = informational[informational['unsuccessful'] == 0]
grp3_4_BD = non_informational[non_informational['unsuccessful'] == 0]
In [105]:
# Differentiate b/w the customers completed the offer w/o viewing it (3) 
# and those who didn't view the offer, and no further action (4).
grp3_4_BD['group'] = grp3_4_BD['event'].apply(lambda x: grp3_4(x))
grp3_4_I['group'] = grp3_4_I['event'].apply(lambda x: grp3_4(x, informational=True))
In [106]:
# concatenate both offer types
grp3_4_df = pd.concat([grp3_4_BD, grp3_4_I]).drop('unsuccessful',axis=1)
grp3_4_df.head()
Out[106]:
person_id offer_id event group
0 0009655768c64bdeb2e877511632db8f 2906b810c7d4411798c6938adc9daaa5 offer received,transaction,offer completed,tra... 3
1 0009655768c64bdeb2e877511632db8f f19421c1d4aa40978ebb69ca19b0e20d offer received,transaction,offer completed,off... 3
2 0009655768c64bdeb2e877511632db8f fafdcd668e3743c1bb461111dcafc2a4 offer received,transaction,offer completed,off... 3
3 0020c2b971eb4e9188eac86d93036a77 ae264e3637204a6fb9bb56bc8210ddfd offer received 4
4 003d66b6608740288d6cc97a6903f4f0 0b1e1539f2cc45b7b9fa7c272da2e1d7 offer received,transaction,offer completed 3

Exploring Group 3 Customers

In [107]:
grp3_df = grp3_4_df[grp3_4_df['group']==3][['person_id', 'offer_id']]
grp3_df = pd.merge(grp3_df, portfolio, on='offer_id', how='inner')
grp3_df = pd.merge(grp3_df, profile, on='person_id', how='inner')
grp3_df.head()
Out[107]:
person_id offer_id offer_num offer_type duration difficulty reward mobile email web social gender age income membership_month membership_year membership_day member_since age_group
0 0009655768c64bdeb2e877511632db8f 2906b810c7d4411798c6938adc9daaa5 9 discount 7 10 2 1 1 1 0 M 33 72000.0 4 2017 21 1913 (30, 40]
1 0009655768c64bdeb2e877511632db8f f19421c1d4aa40978ebb69ca19b0e20d 8 bogo 5 5 5 1 1 1 1 M 33 72000.0 4 2017 21 1913 (30, 40]
2 0009655768c64bdeb2e877511632db8f fafdcd668e3743c1bb461111dcafc2a4 6 discount 10 10 2 1 1 1 1 M 33 72000.0 4 2017 21 1913 (30, 40]
3 00715b6e55c3431cb56ff7307eb19675 2906b810c7d4411798c6938adc9daaa5 9 discount 7 10 2 1 1 1 0 F 58 119000.0 12 2017 7 1683 (50, 60]
4 00715b6e55c3431cb56ff7307eb19675 0b1e1539f2cc45b7b9fa7c272da2e1d7 4 discount 10 20 5 0 1 1 0 F 58 119000.0 12 2017 7 1683 (50, 60]
In [108]:
plot_hist(grp3_df, ['income', 'age'])
grp3_df.describe()
Out[108]:
offer_num duration difficulty reward mobile email web social age income membership_month membership_year membership_day member_since
count 10533.000000 10533.000000 10533.000000 10533.000000 10533.000000 10533.0 10533.000000 10533.000000 10533.000000 10533.000000 10533.000000 10533.000000 10533.000000 10533.000000
mean 4.394854 6.780404 8.220735 3.700275 0.825881 1.0 0.854078 0.365043 54.892528 67157.599924 6.740720 2016.451628 15.945410 2034.596601
std 2.746624 2.233669 6.549410 3.016488 0.379230 0.0 0.353045 0.481465 17.728623 22771.554907 3.520394 1.175964 8.709264 411.946622
min 0.000000 3.000000 0.000000 0.000000 0.000000 1.0 0.000000 0.000000 18.000000 30000.000000 1.000000 2013.000000 1.000000 1452.000000
25% 2.000000 5.000000 5.000000 2.000000 1.000000 1.0 1.000000 0.000000 43.000000 49000.000000 4.000000 2016.000000 8.000000 1706.000000
50% 4.000000 7.000000 7.000000 5.000000 1.000000 1.0 1.000000 0.000000 56.000000 65000.000000 7.000000 2017.000000 16.000000 1949.000000
75% 7.000000 7.000000 10.000000 5.000000 1.000000 1.0 1.000000 1.000000 67.000000 83000.000000 10.000000 2017.000000 24.000000 2304.000000
max 9.000000 10.000000 20.000000 10.000000 1.000000 1.0 1.000000 1.000000 101.000000 120000.000000 12.000000 2018.000000 31.000000 3275.000000
  • The income of consumers in Group 3 is slightly higher than the income of all customers. They are earning almost $2000 extra on average.
  • The average age of Group 3 customers is nearly the same as the rest of the customers.
In [109]:
grp3_df[['email', 'social', 'web', 'mobile']].sum()
Out[109]:
email     10533
social     3845
web        8996
mobile     8699
dtype: int64
In [110]:
plot_bar_sub(grp3_df, ['offer_num', 'offer_type', 'membership_year', 'gender'])
  • Offer 2,3, and 4 are mostly availed by the customers without even being aware of it
  • These are mostly discount offer types.
  • Most of the group 3 customers are members since 2017.
  • There isn't much of a difference between men and females, but there are somewhat more males.

Exploring Group 4 customers

In [111]:
grp4_df = grp3_4_df[grp3_4_df['group']==4][['person_id', 'offer_id']]
grp4_df = pd.merge(grp4_df, portfolio, on='offer_id', how='inner')
grp4_df = pd.merge(grp4_df, profile, on='person_id', how='inner')
grp4_df.head()
Out[111]:
person_id offer_id offer_num offer_type duration difficulty reward mobile email web social gender age income membership_month membership_year membership_day member_since age_group
0 0020c2b971eb4e9188eac86d93036a77 ae264e3637204a6fb9bb56bc8210ddfd 0 bogo 7 10 10 1 1 0 1 F 59 90000.0 3 2016 4 2326 (50, 60]
1 00e52682848542c3a6f59b7824e9a5c5 ae264e3637204a6fb9bb56bc8210ddfd 0 bogo 7 10 10 1 1 0 1 M 40 65000.0 10 2016 18 2098 (30, 40]
2 00ed7e22b32749cfafbfd88592d401d4 ae264e3637204a6fb9bb56bc8210ddfd 0 bogo 7 10 10 1 1 0 1 M 55 94000.0 7 2016 31 2177 (50, 60]
3 01925607d99c460996c281f17cdbb9e2 ae264e3637204a6fb9bb56bc8210ddfd 0 bogo 7 10 10 1 1 0 1 F 57 116000.0 11 2015 19 2432 (50, 60]
4 01e8739670a042b3877f8e843bdf55a7 ae264e3637204a6fb9bb56bc8210ddfd 0 bogo 7 10 10 1 1 0 1 F 54 119000.0 10 2015 24 2458 (50, 60]
In [112]:
# have a look at the mediums they were targetted
grp4_df[['email', 'social', 'web', 'mobile']].sum()
Out[112]:
email     3664
social     644
web       3248
mobile    2787
dtype: int64
In [113]:
plot_bar_sub(grp3_df, ['email', 'social', 'web', 'mobile'])
  • All the customers were sent the offer by email
  • Few of them received the offer through their social media. Youth spend the majority of their time on social media, thus it is critical to target them through various social media apps.
  • Whereas, offers sent through web and mobile are almost same.

b. Percent success of each offer¶

Percent success of any offer may be calculated by determing the total number of times an offer was successfully completed from the aforementioned dataframe and dividing it by the total number of times the offer was issued to the consumer.

During EDA, we calculated percent success of a particular offer by taking the ratio of total number of completed offer to offer received. However, that approach might prove to be ineffective since there are many offers which were completed before the customer even viewed the offer. My previous approach would have considered this to be a successful offer, which it is not.

However, the results below still look very similar to what we obtained during the EDA stage.

In [114]:
percent_success = df.groupby('offer_id').sum()['offer_success'] / df.groupby('offer_id').count()['person_id'] 
percent_success = percent_success.rename('percent_success').reset_index()
portfolio_percentsuccess = pd.merge(percent_success, portfolio, on='offer_id', how='inner')
plot_bar_chart(portfolio_percentsuccess['offer_num'], y=portfolio_percentsuccess['percent_success'],
              title='Percentage of offer success')
# merge the engineered feature with the dataframe.
df = pd.merge(df, percent_success, on='offer_id', how='inner')

c. How many times each customer was sent the same offer?¶

Calculate how many times a customer was sent the same offer. This can be done by slicing the offer_received values from the event column in the transcript dataset. Furthermore, we can groupby the resulting data on the person_id and the customer_id to obtain the number of times a customer was sent the same offer.

Lastly, plot the results on a barchart.

In [115]:
tot_offers = pd.get_dummies(transcript_processed,columns=['event'])
tot_offers = tot_offers[tot_offers['event_offer received'] == 1]
In [116]:
tot_offers = tot_offers.groupby(['person_id', 'offer_id']).count()['event_offer received'].reset_index()
tot_offers = tot_offers.rename(columns={'event_offer received': 'total_similar_offers_received'})
tot_offers.head(3)
Out[116]:
person_id offer_id total_similar_offers_received
0 0009655768c64bdeb2e877511632db8f 2906b810c7d4411798c6938adc9daaa5 1
1 0009655768c64bdeb2e877511632db8f 3f207df678b143eea3cee63160fa8bed 1
2 0009655768c64bdeb2e877511632db8f 5a8bc65990b245e5a138643cd4eb9837 1
In [117]:
tot_offers_count = tot_offers['total_similar_offers_received'].value_counts()
x = tot_offers_count.index
y = tot_offers_count
plot_bar_chart(x, y)

d. How many times each customer was sent any offer¶

Finally, I'll determine how many times a customer received an offer. This could be a beneficial feature for our machine learning model because it makes sense to send repeated offers to a consumer who has already completed the previous offers.

In [118]:
totoffer_by_person = tot_offers.groupby('person_id').sum().reset_index()
totoffer_by_person.columns = ['person_id', 'total_offers_received']

totoffer_by_person_count = totoffer_by_person['total_offers_received'].value_counts()
x = totoffer_by_person_count.index
y = totoffer_by_person_count
plot_bar_chart(x,y)
totoffer_by_person.head(3)
Out[118]:
person_id total_offers_received
0 0009655768c64bdeb2e877511632db8f 5
1 0011e0d4e6b944f998e987f904e8c1e5 5
2 0020c2b971eb4e9188eac86d93036a77 5

Merge with dataframe created above to include the engineered features.

In [119]:
df = pd.merge(df, tot_offers, on=['offer_id','person_id'], how='inner')
df = pd.merge(df, totoffer_by_person, on='person_id', how='inner')
df.head(3)
Out[119]:
offer_success offer_id person_id percent_success total_similar_offers_received total_offers_received
0 1 f19421c1d4aa40978ebb69ca19b0e20d 18f6d43ce45c41b398606753c3412f60 0.587763 1 6
1 0 3f207df678b143eea3cee63160fa8bed 18f6d43ce45c41b398606753c3412f60 0.464761 1 6
2 1 4d5c57ea9a6940dd891ad53e9dbe8da0 18f6d43ce45c41b398606753c3412f60 0.459239 1 6
In [120]:
## save the required csv files for model implementation
In [121]:
df.to_csv('../data/cleaned/feat_engineered.csv', index=False)

profile.drop('age_group', axis=1).to_csv('../data/cleaned/profile.csv', index=False)
portfolio.to_csv('../data/cleaned/portfolio.csv', index=False)

transcript_processed.to_csv('../data/cleaned/transcript_processed.csv', index=False)

Conclusion¶

  • The average age of a starbuck's customer is almost 55 years.
  • The offer with the highest overall success percentage is offer number 6. As a result, if Starbucks wants to focus on a single offer, that offer is unquestionably the number 6.
  • The best offer for people under 30 is the offer number 5.
  • The best offer for people above 30 is the offer number 6.
  • The best offers for the elderly are numbered 1 and 8.
  • The best offer for both male and female customers is the offer number 6.
  • The majority of customers received between four and five offers.
  • Almost *75% of the total customers received a same offer only once.
In [ ]: